perm filename PATREC[4,KMC] blob
sn#110829 filedate 1974-07-08 generic text, type T, neo UTF8
00100
00200
00300 PATTERN-MATCHING RULES FOR THE RECOGNITION OF
00400 NATURAL LANGUAGE DIALOGUE EXPRESSIONS
00500
00600 Kenneth Mark Colby
00700 and
00800 Roger C. Parkison
00900
01000
01100 INTRODUCTION
01200
01300 To recognize something is to identify it as an instance of
01400 the "same again". This familiarity is possible because of recurrent
01500 characteristics of the world which repeat themselves. We shall
01600 describe an algorithm which recognizes recurrent characteristics of
01700 natural language dialogue expressions. It utilizes a multi-stage
01800 sequence of pattern-matching rules for progressively transforming an
01900 input expression until it eventually matches an abstract stored
02000 pattern. The stored pattern has a pointer to a response function in
02100 memory which decides what to do once the input has been recognized.
02200 Here we discuss only the recognizing functions, except for one
02300 response function (anaphoric substitution) which interactively aids
02400 the recognition process. Details of how the response functions
02500 operate will be described in a future communication by William Faught
02600 and ourselves.
02700 We are constructing and testing a simulation of paranoid
02800 thought processes; our problem is to reproduce paranoid linguistic
02900 behavior in a teletyped diagnostic psychiatric interview. The
03000 diagnosis of paranoid states, reactions or modes is made by
03100 clinicians who judge the degree of correspondence between what they
03200 observe in an interview and their conceptual model of paranoid
03300 behavior. There exists a high degree of agreement among
03400 psychiatrists about this conceptual model which relies mainly on what
03500 an interviewee says and how he says it.
03600 Natural language is a life-expressing code which people use
03700 for communication with themselves and others. In a real-life
03800 dialogue such as a psychiatric interview, the participants have
03900 interests, intentions, and expectations which are revealed in their
04000 linguistic expressions. An interactive simulation of a paranoid
04100 patient must be able to demonstrate typical paranoid linguistic
04200 behavior. To achieve this effect, our paranoid model must have the
04300 ability to deal with the teletyped messages of an interviewer.
04400 A number of approaches have been taken for dealing with
04500 natural language dialogue expressions. (Winograd,1972; Woods,1970).
04600 These approaches rely on parsers which conduct a detailed syntactic
04700 and semantic analysis. They perform well for the purposes for which
04800 they were designed. Their weakness, for our purposes, lies in their
04900 lack of neglecting and ignoring mechanisms. Such mechanisms are
05000 necessary in a program which accepts and responds to unrestricted
05100 conversational English characterized by expressions novel to the
05200 program.
05300 How humans process natural language is largely unknown. They
05400 possess some knowledge of grammatical rules, but this fact does not
05500 entail that they use a grammar in interpreting and producing
05600 language. It seems implausible to us that people possess full
05700 transformational grammars for processing language. Language is
05800 what is recognized but the processes involved may not be linguistic
05900 or grammatical. Originally transformational grammars were not
06000 designed to "understand" a large subset of English; they constituted
06100 a formal method for deciding whether a string is grammatical.
06200 An analysis of what one's problem actually is should guide
06300 the selection or invention of methods appropriate to its solution.
06400 Our problem is not to develop a consistent and general theory of
06500 language nor to assert empirically testable hypotheses about how
06600 people process language. Our problem is to design an algorithm
06700 which recognizes what is being said in a dialogue and what is being
06800 said about it in order to make a response such that a sample of I-O
06900 pairs from the paranoid model is judged similar to a sample of I-O
07000 pairs from paranoid patients. The design task belongs to
07100 artificial intelligence in which the criterion is how adequately the
07200 computer program performs mind-like functions. New methods had to be
07300 devised for an algorithm to participate in a human dialogue in a
07400 paranoid-patient-like way. We sought effective methods which could
07500 operate efficiently in real time. Since our method provides a
07600 general way of many-to-one mapping from surface expressions to a
07700 single stored pattern, it is not limited to the simulation of
07800 paranoia, but can be used by any type of "host" system which takes
07900 natural language as input.
08000 Our method is to transform the input until a pattern is
08100 obtained which matches completely or partially a more abstract stored
08200 pattern. This strategy has proved adequate for our purposes a
08300 satisfactory percentage of the time. The power of this method for
08400 natural language dialogues lies in its ability to ignore as
08500 irrelevant some of what it recognizes and everything it does not
08600 recognize at all. A linguistic parser doing word-by-word,
08700 parts-of-speech analysis fails when it cannot find one or more of the
08800 input words in its dictionary. A system that must know every word is
08900 too fragile for unrestricted dialogues.
09000 In early versions of the paranoid model, such as PARRY1, some
09100 of the pattern recognition mechanisms allowed the elements of the
09200 pattern to be order independent (Colby, Weber, and Hilf, 1971). For
09300 example, consider the following expressions:
09400 (1) WHERE DO YOU WORK?
09500 (2) WHAT SORT OF WORK DO YOU DO?
09600 (3) WHAT IS YOUR OCCUPATION?
09700 (4) WHAT DO YOU DO FOR A LIVING?
09800 (5) WHERE ARE YOU EMPLOYED?
09900 In PARRY1 a procedure scans these expressions looking for an
10000 information-bearing contentive such as "work", "for a living", etc.
10100 When it finds such a contentive along with "you" or "your" in the
10200 expression, regardless of word order, it responds to the expression
10300 as if it were a question about the nature of one's work. This method
10400 correctly classifies the five sentences above. Unfortunately, it
10500 includes the two examples below in the same category:
10600 (6) DOES YOUR FATHER'S CAR WORK?
10700 (7) HOW DID THINGS WORK OUT FOR YOU?
10800 An insensitivity to word order has the advantage that lexical items
10900 representing different parts of speech can represent the same
11000 concept,e.g. the word "work" represents the same concept whether it
11100 is used as a noun or a verb. But a price is paid for this resilience
11200 and elasticity. We find from experience that, since English relies
11300 heavily on word order to convey the meaning of its messages, the
11400 average penalty of misunderstanding (to be distinguished from
11500 ununderdstanding), is too great. Hence in PARRY2, as will be
11600 described shortly, all the patterns require a specified word order.
11700 For high-complexity problems it is helpful to have
11800 constraints. Diagnostic psychiatric interviews (and especially
11900 those conducted over teletypes) have several natural constraints.
12000 First, clinicians are trained to ask certain questions in certain
12100 ways. This limits the number of patterns required to recognize
12200 utterances about each topic. Second, only a few hundred standard
12300 topics are brought up by interviewers who are, furthermore, trained
12400 to use everyday expressions and especially those used by the patient
12500 himself. When the interview is conducted by teletypes, expressions
12600 tend to be shortened since the interviewer tries to increase the
12700 information transmission rate over the slow channel of a teletype.
12800 Finally, teletyped interviews represent written utterances and
12900 utterances are known to be highly redundant such that unrecognized
13000 words can be ignored without losing the meaning of the message. Also
13100 utterances are loaded with idioms, cliches, pat phrases, etc. - all
13200 being easy prey for a pattern-matching approach. It is
13300 time-wasting and usually futile to try to decode an idiom by
13400 analyzing the meanings of its individual words.
13500 We now describe the pattern-matching functions of the
13600 algorithm in some detail. (See Fig. 1 for a diagram of the overall
13700 flow of control).
13800
13900
14000 OVERVIEW
14100
14200 PARRY2 has two primary modules. The first attempts to
14300 RECOGNIZE the input and the second RESPONDS. This paper is primarily
14400 about the RECOGNIZE module. It functions independently of the
14500 RESPOND module except in the case of pronoun references, which the
14600 RESPOND module provides to the RECOGNIZER on request.
14700 The recognition module has 4 main steps:
14800 1) Identify the words in the question and convert them to
14900 internal synonyms.
15000 2) Break the input into segments at certain bracketing words.
15100 3) Match each segment (independently) to a stored pattern.
15200 4) Match the resulting list of recognized segments to a stored
15300 compound pattern.
15400 Each of these steps, except the segmenting, throws away what it
15500 cannot identify. Occasionally a reference to an unknown topic is
15600 mis-recognized as some familiar topic.
15700
15800 PREPROCESSING
15900
16000 Each word in the input expression is first looked up in a
16100 dictionary of (currently) about 1900 entries which, for the sake of
16200 speed, is maintained in core during run-time. (The dictionary is
16300 given in Appendix 2.) The dictionary, which was built empirically
16400 from thousands of teletyped interviews with previous versions of the
16500 model, consists of words, groups of words, and names of word-classes
16600 they can be translated into. The size of the dictionary is
16700 determined by the patterns, i.e. a word is not included unless it
16800 appears in one or more patterns. Entries in the dictionary reflect
16900 PARRY2's main interests. If a word in the input is not in the
17000 dictionary, it is checked to see if it ends with one of the common
17100 suffixes given in Fig. 2. If it does, the suffix is removed and the
17200 remaining word is looked up again. If it is still not in the
17300 dictionary, it is dropped from the pattern being formed. Thus if the
17400 input is:
17500 WHAT IS YOUR CURRENT OCCUPATION?
17600 and the word "current" is not in the dictionary, the pattern at
17700 this stage becomes:
17800 ( WHAT IS YOUR OCCUPATION )
17900 The question-mark is thrown away as redundant since questions are
18000 recognized by word order. (A statement followed by a question mark
18100 (YOU GAMBLE?) is responded to in the same way as that statement
18200 followed by a period.) Synonymic translations of words are made so
18300 that the pattern becomes, for example:
18400 ( WHAT BE YOU JOB )
18500 Some groups of words (i.e. idioms) are translated as a group
18600 so that, for example, "for a living" becomes "for job". Certain other
18700 juxtaposed words are contracted into a single word, e.g. "place of
18800 birth" becomes "birthplace". This is done to deal with groups of
18900 words which are represented as a single element in the stored
19000 pattern, thereby preventing segmentation from occurring at the wrong
19100 places, such as at a preposition inside an idiom or phrase. Besides
19200 these contractions, certain expansions are made so that for example,
19300 "DON'T" becomes "DO NOT" and "I'D" becomes "I WOULD".
19400 Misspellings can be the bane of teletyped interviews for an
19500 algorithm. Here they are handled in two ways. First, common
19600 misspellings of important words are simply put in the dictionary.
19700 Thus "yuu" is known to mean "you". The apostrophe is often omitted
19800 from contractions so most contractions are recognized with or without
19900 it. These common misspellings were gathered from over 4000 interviews
20000 with earlier versions of the paranoid model.
20100 Second, five common forms of typing error are checked
20200 systematically. These are:
20300 1) Doubled letter
20400 2) Extraneous letter
20500 3) Forgetting to hold the "shift key" for an apostrophe
20600 4) Hitting a nearby key on the keyboard
20700 5) Transposing two letters in a word
20800 The first three errors can be corrected by deleting the offending
20900 character from the word. This is accomplished by deleting each
21000 character in turn until the word is recognized. The fourth type of
21100 error is only checked for eight of the more common near misses. These
21200 were also empirically determined and involve the letter pairs (T Y),
21300 (Q W), (Y U), (I O), (G H), (O P), (A S), and (N M). These methods
21400 are all based on typing errors, but they also correct some legitimate
21500 English spelling errors. Two-letter transposition corrects, for
21600 example, "beleive" to "believe".
21700
21800 SEGMENTING
21900
22000 Another weakness in the crude pattern matching of PARRY1 is
22100 that it takes the entire input expression as its basic processing
22200 unit. If only two words are recognized in an eight word utterance,
22300 the risk of misunderstanding is great. We need a way of dealing with
22400 units shorter than the entire input expression.
22500 Aided by a heuristic from work in machine-translation (Wilks,
22600 1973 ), we devised a way of bracketing the pattern constructed up to
22700 this point into shorter segments using prepositions, wh-forms,
22800 certain verbs, etc. as bracketing points. (A list of the bracketing
22900 terms appears in Fig. 3). These points tend to separate
23000 prepositional phrases and embedded clauses from the main clause.
23100 The new pattern formed is termed either "simple", having no
23200 delimiters within it, or "compound", i.e., being made up of two or
23300 more simple patterns. A simple pattern might be:
23400 ( WHAT BE YOU JOB )
23500 whereas a compound pattern would be:
23600 (( WHY BE YOU ) ( IN HOSPITAL ))
23700 Our experience with this method of segmentation shows that compound
23800 patterns from teletyped psychiatric dialogues rarely consist of more
23900 than three or four segments.
24000 After certain verbs (See Fig. 4) a bracketing occurs to
24100 replace the commonly omitted "THAT", such that:
24200 ( I THINK YOU BE AFRAID )
24300 becomes
24400 (( I THINK ) ( YOU BE AFRAID ))
24500
24600 MATCHING INDIVIDUAL SEGMENTS
24700
24800 Conjunctions serve only as markers for the segmenter and they
24900 are dropped out after segmentation.
25000 Negations are handled by extracting the "NOT" from the
25100 segment and assigning a value to a global variable which indicates
25200 that the expression is negative in form. When a pattern is finally
25300 matched, this variable is consulted. Some patterns have a pointer to
25400 a pattern of opposite meaning if a "NOT" could reverse their
25500 meanings. If this pointer is present and a "NOT" was found, then
25600 the pattern matched is replaced by its opposite, e.g. ( I not trust
25700 you ) is replaced by the pattern ( I mistrust you ). We have not yet
25800 observed the troublesome case of "he gave me not one but two
25900 messages". (There is no need to scratch where it doesn't itch).
26000 Substitutions are also made in certain cases. Some segments
26100 contain pronouns which could stand for a number of different things
26200 of importance to PARRY2. As we mentioned in the introduction,
26300 the response functions of memory keep track of the context in order
26400 to give pronouns and other anaphoras a correct interpretation. For
26500 example, the segment:
26600 ( DO YOU AVOID THEM )
26700 could refer to the Mafia, or racetracks, or other patients, depending
26800 on the context. When such a segment is encountered, the pronoun is
26900 replaced by its current anaphoric value as determined by the response
27000 functions, and a more specific segment such as:
27100 ( DO YOU AVOID MAFIA )
27200 is looked up.
27300 Other utterances, such as "Why did you do that?" or just
27400 "Why?" (which might be regarded as a massive ellipsis), clearly refer
27500 back to previous utterances. These utterances match very general
27600 patterns which identify the type of question without indicating the
27700 exact topic. The response function which responds to "Why?" consults
27800 the context to produce an appropriate answer.
27900 The algorithm next attempts to match the segments with stored
28000 simple patterns which currently number about 1700. (The simple
28100 patterns appear in Appendix 3). First a complete and perfect match is
28200 sought. When a match is found, the stored pattern name has a
28300 pointer to the name of a response function in memory which decides
28400 what to do further. If a match is not found, further transformations
28500 of the segment are carried out and a "fuzzy" match is tried.
28600 For fuzzy matching at this stage, we adopted the heuristic
28700 rule of dropping elements in the segment one at a time and attempting
28800 a match each time. This heuristic allows ignoring familiar words in
28900 unfamiliar contexts. For example, "well" is important in "Are you
29000 well?" but meaningless in "Well are you?".
29100 Deleting one element at a time results in, for example, the
29200 pattern:
29300 ( WHAT BE YOU MAIN PROBLEM )
29400 becoming successively:
29500 (a) ( BE YOU MAIN PROBLEM )
29600 (b) ( WHAT YOU MAIN PROBLEM )
29700 (c) ( WHAT BE MAIN PROBLEM )
29800 (d) ( WHAT BE YOU PROBLEM )
29900 (e) ( WHAT BE YOU MAIN )
30000 Since the stored pattern in this case matches (d), (e) would not be
30100 constructed. We found it unwise to delete more than one element
30200 since our segmentation method usually yields segments containing a
30300 small number (1-4) of words.
30400 Dropping an element at a time provides a probability
30500 threshold for fuzzy matching which is a function of the length of
30600 the segment. If a segment consists of five elements, four of the five
30700 must be present in a particular order (with the fifth element missing
30800 in any position) for a match to occur. If a segment contains four
30900 elements, three must match - and so forth.
31000
31100 COMPOUND-PATTERN MATCH
31200
31300 When more than one simple pattern is detected in the input, a
31400 second matching is attempted against about 500 compound patterns.
31500 Certain patterns, such as ( HELLO ) and ( I THINK ), are dropped
31600 because they are considered meaningless. If a complete match is not
31700 found, then simple patterns are dropped, one at a time, from the
31800 complex pattern. This allows the input,
31900 (( HOW DO YOU COME ) ( TO BE ) ( IN HOSPITAL ))
32000 to match the stored pattern,
32100 (( HOW DO YOU COME ) ( IN HOSPITAL )).
32200
32300 If no match can be found at this point, the algorithm has
32400 arrived at a default condition and the appropriate response functions
32500 decide what to do. For example, in a default condition, the model
32600 may assume control of the interview, asking the interviewer a
32700 question, continuing with the topic under discussion or introducing a
32800 new topic.
32900 An annotated example of a diagnostic psychiatric interview is
33000 presented in Appendix 1.
33100
33200
33300 ADVANTAGES AND LIMITATIONS
33400
33500 As mentioned, one of the main advantages of a
33600 pattern-matching strategy is that it can ignore as irrelevant both
33700 some of what it recognizes and what it does not recognize at all.
33800 There are several million words in English, each possessing from one
33900 to over a hundred senses. To construct a machine-usable word
34000 dictionary of this magnitude is out of the question at this time.
34100 Recognition of natural language input in the manner described above
34200 allows real-time interaction in a dialogue since it avoids becoming
34300 ensnarled in combinatorial disambiguations and long chains of
34400 inferencing which would slow a dialogue algorithm down to
34500 impracticality, if it could even function at all. The price paid for
34600 pattern-matching is that sometimes, but rarely, ambiguities slip
34700 through.
34800 Another advantage of this method is its speed. The algorithm
34900 consists of about 28K of programs written in MLISP, 16K of data in
35000 LISP, and 16K of data in machine language with several overlays. The
35100 complete language recognition process requires less than one second
35200 of real time on a time-shared DEC PDP-10.
35300 A drawback to PARRY1 is that it reacts to the first pattern
35400 it finds in the input rather than characterizing the input as fully
35500 as possible and then deciding what to do based on a number of tests.
35600 Another practical difficulty with PARRY1 from a programmer's
35700 viewpoint, is that, since it is a procedural model, elements of the
35800 patterns are strung out in various procedures throughout the
35900 algorithm. It is often a considerable chore for the programmer to
36000 determine whether a given pattern is present and precisely where it
36100 is. In PARRY2 the patterns are all collected in one part of the
36200 data-base where they can easily be examined.
36300 Concentrating all the patterns in the data base gives PARRY2
36400 a limited "learning" ability. When an input fails to match any
36500 stored pattern or matches an incorrect one, as judged by a human
36600 operator, a pattern which matches the input can be put into the
36700 data-base automatically. If the new pattern has the same meaning as
36800 a previously stored pattern, the human operator must provide the name
36900 of the appropriate response function. If he doesn't remember the
37000 name, he may try to rephrase the input in a form recognizable to
37100 PARRY2 and it will name the response function associated with the
37200 rephrasing. These mechanisms are not "learning" in the commonly-used
37300 sense but they do allow a person to transfer his knowledge into
37400 PARRY2's data-base with very little effort.
37500 Informal observation thus far shows PARRY2's linguistic
37600 recognition abilities to be quite superior to PARRY1's. A more
37700 systematic and quantitative evaluation of performance is now being
37800 carried out. PARRY1 was extensively tested by having judges make
37900 ratings of its performance along several dimensions, one of which was
38000 linguistic noncomprehension (Colby and Hilf, 1974). These judges also
38100 made ratings of teletyped interviews with psychiatric patients and
38200 with a random version of PARRY1. The mean ratings of PARRY1 along the
38300 dimension of linguistic noncomprehension were better than those
38400 received by RANDOM-PARRY but were three times worse than the mean
38500 ratings received by patients. Once the ratings of PARRY2 along this
38600 dimension are completed, we will be able to compare them with those
38700 of PARRY1 and the patients and obtain a more objective measure of
38800 improvement.
38900
39000
39100 REFERENCES
39200
39300 Colby, K.M., Weber, S., and Hilf, F.D. (1971). Artificial Paranoia.
39400 ARTIFICIAL INTELLIGENCE, 2, 1-25.
39500
39600 Colby, K.M. and Hilf, F.D. (1974). Multidimensional Evaluation of
39700 of a Computer Simulation of Paranoid Thought. To appear
39800 in KNOWLEDGE AND COGNITION, L. Gregg, (Ed.)
39900
40000 Wilks, Y. (1973). An Artificial Intelligence Approach to Machine
40100 Translation. In COMPUTER MODELS OF THOUGHT AND
40200 LANGUAGE, R.C.Schank and K.M. Colby, Eds., W.H. Freeman,
40300 San Francisco.
40400
40500 Winograd, T.A. (1972). A Program for Understanding Natural
40600 Language, COGNITIVE PSYCHOLOGY, 3, 1-191.
40700
40800 Woods, W.A. Transition Network Grammars for Natural Language
40900 Analysis. COMMUNICATIONS OF THE ACM, 13, 591-606.